Code
pacman::p_load(readxl, sf, tidyverse, units, matrixStats, olsrr , gdata, tmap, sfdep, jsonlite, onemapsgapi, rvest, ranger, SpatialML, readxl, GWmodel, httr, cowplot, Metrics )Rhonda Ho Kah Yee
March 9, 2023
March 26, 2023
Hello! This is Rhonda Ho’s take-home Assignment 3 for IS415 module.
To view/hide all the code at once, please click on the “</> code” tab beside the title of this html document and select the option to view/hide the code.
The full details of this assignment can be found here.
In this take-home exercise, I am tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. I am also required to compare the performance of the conventional OLS method versus the geographical weighted methods.
Below is the list of datasets i have collected.
| Type | Dataset | File Format |
|---|---|---|
| Aspatial | Resale Flat Prices | .csv |
| Geospatial | Master Plan 2014 Subzone Boundary (Web) | .kml |
| Geospatial | Bus Stop Location | .kml .shp |
| Geospatial | Train Station | .kml .shp |
| Geospatial | School Directory and Information | .csv |
| Geospatial-Extracted | Child Care Services | .shp |
| Geospatial-Extracted | Eldercare Services | .rds |
| Geospatial-Extracted | Hawker Centres | .rds |
| Geospatial-Extracted | Kindergarterns | .rds |
| Geospatial-Extracted | Parks | .rds |
| Geospatial | Supermarket | .geoson |
| Geospatial | Shopping Malls | .csv |
The following packages will be used:
sf: used for importing, managing, and processing geospatial data
tidyverse: a collection of packages for data science tasks
tmap: used for creating thematic maps, such as choropleth and bubble maps
sfdep: used to create spatial weights matrix objects, global and local spatial autocorrelation statistics and related calculations (e.g. spatially lag attributes)
onemapsgapi: used to query Singapore-specific spatial data, alongside additional functionalities.
units: used to for manipulating numeric vectors that have physical measurement units associated with them
matrixStats: a set of high-performing functions for operating on rows and columns of matrices
jsonlite: a JSON parser that can convert from JSON to the appropraite R data types
olsrr: used for building least squares regression models
httr : used to make API calls, such as a GET request
ggpubr: used for multivariate data visualisation & analysis
GWmodel: provides a collection of localised spatial statistical methods, such as summary statistics, principal components analysis, discriminant analysis and various forms of GW regression
SpatialML: allows for a geographically weighted random forest regression including a function to find the optical bandwidth
Ranger: a fast implementation of Random Forests, particularly suited for high dimensional data
cowplot: a simple add-on to ggplot and provides various features that help with creating publication-quality figures
Metrics: allow us to calculate the rmse of our models
Before I can perform my tasks, I need to obtain the following appropriate independent variables from my datasets.
Structural factors:
Area of the unit
Floor level
Remaining lease
Age of the unit
Locational factors
Proximity to CBD
Proximity to eldercare
Proximity to foodcourt/hawker centres
Proximity to MRT
Proximity to park
Proximity to good primary school
Proximity to shopping mall
Proximity to supermarket
Numbers of kindergartens within 350m
Numbers of childcare centres within 350m
Numbers of bus stop within 350m
Numbers of primary school within 1km
Rows: 149,071
Columns: 11
$ month <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…
Having a glimpse of our data, we can observe that this dataset has a total of 11 columns and 149071 rows. The columns consist of month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price.
For this assignment, the dataset will focus on:
Transaction period: Oct 2022 to February 2023
Training dataset period: Oct 2022 to December 2022
Test dataset period: January 2023 to February 2023
Type of rook flat: 5-room flats
The code chunk filters our dataset accordingly.
Based on my senior’s experience, “ST.” is usually written as “SAINT” instead - for example, St. Luke’s Primary School is written as Saint Luke’s Primary School. To address, this, we’ll replace such occurrences:
Subsequently, as our dataset is missing the coordinates for the location, I created a function, geocode(), which calls onemapAPI to retrieve the geometry of each location.
#library(httr)
geocode <- function(block, streetname) {
base_url <- "https://developers.onemap.sg/commonapi/search"
address <- paste(block, streetname, sep = " ")
query <- list("searchVal" = address,
"returnGeom" = "Y",
"getAddrDetails" = "N",
"pageNum" = "1")
res <- GET(base_url, query = query)
restext<-content(res, as="text")
output <- fromJSON(restext) %>%
as.data.frame %>%
select(results.LATITUDE, results.LONGITUDE)
return(output)
}While exploring the API, I realised that the best search parameter would be to combine the column block with its street_name. Thus, the function geocode() takes in both the block and street name .
To reuse the resale dataset, without calling the API, I saved it into a rds file.
In this section, I will be pre processing the structural data that I need for my tasks.
let’s first take a look at the storey_range values.
[1] "01 TO 03" "04 TO 06" "07 TO 09" "10 TO 12" "13 TO 15" "16 TO 18"
[7] "19 TO 21" "22 TO 24" "25 TO 27" "28 TO 30" "31 TO 33" "34 TO 36"
[13] "37 TO 39" "40 TO 42"
We can see that there are 17 storey level ranges categories. The following code chunk recodes the categorical naming to numerical values by assigning 1 to the first range 01 TO 03 and 17 to the last range 49 TO 51.
storey storey_order
1 01 TO 03 1
2 04 TO 06 2
3 07 TO 09 3
4 10 TO 12 4
5 13 TO 15 5
6 16 TO 18 6
7 19 TO 21 7
8 22 TO 24 8
9 25 TO 27 9
10 28 TO 30 10
11 31 TO 33 11
12 34 TO 36 12
13 37 TO 39 13
14 40 TO 42 14
Next, I will combine it to the original resale df.
Currently, the remaining_lease is in string format but we need it to be numeric. Thus, we need to split the string into month and year and then replace the original values with the calculated value of the remaining lease in years.
#e.g of resale$remaining_lease - 53 years 10 months
str_list <- str_split(resale_rds$remaining_lease, " ")
#after spliting, [53, years, 10, months]
for (i in 1:length(str_list)){
if (length(unlist(str_list[i])) > 2) {
year <- as.numeric(unlist(str_list[i])[1])
month <- as.numeric(unlist(str_list[i])[3])
resale_rds$remaining_lease[i] <- year + round(month/12, 2)
}
else {
year <- as.numeric(unlist(str_list[i])[1])
resale_rds$remaining_lease[i] <- year
}
}To get the age of the unit, I decided to take the current year i.e 2023 and minus the lease commence date year of the the unit.
Finally, we can convert it into a sf.
Under this section, we will check for:
check for missing values
check correct coordinates system
check for invalid geometries
We can observe that we have no missing values.
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Our dataset have the correct Singapore coordinates system of 3414.
Based on the output, our geometries are valid.
Once again, I saved the processed resale_sf in an rds file.
The code chunk below retrieves the geospatial polygon data for Singapore’s region in 2019.
Reading layer `MP14_SUBZONE_WEB_PL' from data source
`C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial\MP14_SUBZONE_WEB_PL.kml'
using driver `KML'
Simple feature collection with 323 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XYZ
Bounding box: xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
For some of the locational factors, I will be utilising onemapAPI to retrieve its geometry data.
One method I discovered was the usage of a token to retrieve geometry of locational factors based on its related theme. But before we can proceed on, we need to register for account here and retrieve the token.
The code chunk below helps us view the available themes. As a token is needed, I first saved the output in an rds file.
By reading the according file, I sorted the themes in alphabetical order, for easier reference.
# A tibble: 193 × 5
THEMENAME QUERYNAME ICON CATEG…¹ THEME…²
<chr> <chr> <chr> <chr> <chr>
1 ABC Waters Completed abcwater… abcl… Enviro… PUBLIC…
2 ABC Waters Construction abcwater… tend… Enviro… PUBLIC…
3 ABC Waters Sites abcwater… abcl… Enviro… PUBLIC…
4 Active Cemeteries activece… circ… Enviro… NATION…
5 Aedes Mosquito Breeding Habitats - Central breeding… Pict… Health NATION…
6 Aedes Mosquito Breeding Habitats - North East breeding… Pict… Health NATION…
7 Aedes Mosquito Breeding Habitats - North West breeding… Pict… Health NATION…
8 Aedes Mosquito Breeding Habitats - South East breeding… Pict… Health NATION…
9 Aedes Mosquito Breeding Habitats - South West breeding… Pict… Health NATION…
10 After Death Facilities afterdea… coff… Enviro… NATION…
# … with 183 more rows, and abbreviated variable names ¹CATEGORY, ²THEME_OWNER
Looking through the available themes from onemap API, some of the themes relevant to our tasks is:
| Theme Name | Query Name |
|---|---|
| Child Care Services | childcare |
| Eldercare Services | eldercare |
| Hawker Centres_New (Note: there are other similar themes such as Hawker Centres and Healthier Hawker Centres) | hawkercentre_new |
| Kindergartens | kindergartens |
| Parks (Note: there are other similar themes such as Parks@SG and NParks Parks and Nature Reserves) | nationalparks |
For the following code chunks, I created a shapefile for each locational factor I am interested in.
The steps taken are:
Retrieve data such as the geometry and name of the place/amenity by using onemap’s get_theme() function which takes in a theme (i.e query name) and the store the data in a df
Convert the df to a sf object by using the st_as_sf() function
Transform crs information to Singapore coordinates system i.e 3414 by using st_transform() function
Write to a shapefile using st_write() function
#| eval: false
#theme: childcare
#retrieve the data such as the geometry and name accordingly to the theme
childcare_tibble <- get_theme(token, "childcare")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
childcare_sf <- st_as_sf(childcare_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
st_write(obj = childcare_sf,
dsn = "data/geospatial",
layer = "childcare",
driver = "ESRI Shapefile",
append=FALSE)#| eval: false
#theme: eldercare
#retrieve the data such as the geometry and name based accordingly to the theme
eldercare_tibble <- get_theme(token, "eldercare")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
eldercare_sf <- st_as_sf(eldercare_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(eldercare_sf, "data/rds/eldercare.rds")#| eval: false
#theme: new hawker centres
#retrieve the data such as the geometry and name based accordingly to the theme
#hawkercentre_new_tibble <- get_theme(token, "hawkercentre_new")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
hawkercentre_new_sf <- st_as_sf(hawkercentre_new_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(hawkercentre_new_sf, "data/rds/hawker_new.rds")
#issues writing in this manner
#st_write(obj = hawkercentre_new_sf,
# dsn = "data/geospatial",
# layer = "hawkercentre_new",
# driver = "ESRI Shapefile",
# append=FALSE)#| eval: false
#theme: kindergartens
#retrieve the data such as the geometry and name based accordingly to the theme
kindergartens_tibble <- get_theme(token, "kindergartens")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
kindergartens_sf <- st_as_sf(kindergartens_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(kindergartens_sf, "data/rds/kindergartens.rds")#| eval: false
#theme: parks
#retrieve the data such as the geometry and name based accordingly to the theme
nationalparks_tibble <- get_theme(token, "nationalparks")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
nationalparks_sf <- st_as_sf(nationalparks_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(nationalparks_sf, "data/rds/nationalparks.rds")Import the following to be used for our tasks later.
Reading layer `childcare' from data source
`C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 1925 features and 5 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 11810.03 ymin: 25596.33 xmax: 45404.24 ymax: 49300.88
Projected CRS: SVY21 / Singapore TM
For this assignment, I consider the CBD area to be in the Downtown Core so I will be using the coordinates of Downtown Core .
Reading layer `supermarkets-geojson' from data source
`C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial\supermarkets-geojson.geojson'
using driver `GeoJSON'
Simple feature collection with 526 features and 2 fields
Geometry type: POINT
Dimension: XYZ
Bounding box: xmin: 103.6258 ymin: 1.24715 xmax: 104.0036 ymax: 1.461526
z_range: zmin: 0 zmax: 0
Geodetic CRS: WGS 84
Reading layer `BusStop' from data source
`C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5159 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
Reading layer `Train_Station_Exit_Layer' from data source
`C:\RhondaHO\IS415-GAA\Take-home_Assgn\Take-home_Assgn3\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 562 features and 2 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 6134.086 ymin: 27499.7 xmax: 45356.36 ymax: 47865.92
Projected CRS: SVY21
While searching for a dataset for primary school locations, i’ve crossed upon this dataset from data.gov.sg which has the general information of schools in Singapore. By filtering out themainlevel_code which represents the different types of schools to be Primary, I will be able to extract out the address and postal codes of primary schools in Singapore.
Rows: 183
Columns: 3
$ school_name <chr> "ADMIRALTY PRIMARY SCHOOL", "AHMAD IBRAHIM PRIMARY SCHOOL"…
$ address <chr> "11 WOODLANDS CIRCLE", "10 YISHUN STREET 11", "100 Br…
$ postal_code <chr> "738907", "768643", "579646", "159016", "544969", "569785"…
Based on the output, we can observe that there are 183 primary schools in Singapore. However, in 2022, the primary school, Juying Primary School was merged together with Pioneer Primary School and it cannot be found via the API . Thus, I decided to remove it out of our data.
Once again, we can use geocode() function we created earlier to help us extract the geometry of each primary school by its school name.
To reuse the primary school data without recalling the API, I saved it in an rds file.
Next, we can convert our df into a sf.
However, besides this, we need to determine what is a good primary school (which is necessary for 1 of our tasks). Based on the assumption that a good primary school is a popular one, I picked out the top 10 popular primary schools, referencing its popularity from this website.
popular_primary_schools <-c("Pei Hwa Presbyterian Primary School",
"Gongshang Primary School",
"Riverside Primary School",
"Red Swastika School",
"Punggol Green Primary School",
"Princess Elizabeth Primary School",
"Westwood Primary School",
"St. Hilda’s Primary School",
"Catholic High School (Primary Section)",
"Ai Tong School")
popular_primary_schools [1] "Pei Hwa Presbyterian Primary School"
[2] "Gongshang Primary School"
[3] "Riverside Primary School"
[4] "Red Swastika School"
[5] "Punggol Green Primary School"
[6] "Princess Elizabeth Primary School"
[7] "Westwood Primary School"
[8] "St. Hilda’s Primary School"
[9] "Catholic High School (Primary Section)"
[10] "Ai Tong School"
Next, I need to check whether the top 10 most popular primary schools can be found in the primary school data i extracted earlier. But before that, to make things consistent, I used lapply() function and make the schools names I picked out from the website to be all in uppercase.
Based on the output above, out of the 10 primary schools, only 8 can be found. The code chunk below tells us which primary schools are missing.
[[1]]
[1] "ST. HILDA’S PRIMARY SCHOOL"
[[2]]
[1] "CATHOLIC HIGH SCHOOL (PRIMARY SECTION)"
Looking further into our dataset, I found out that in the original primary school dataset, the schools names of the output above is written different. For eg, CANOSSA CATHOLIC PRIMARY SCHOOL is the same as CATHOLIC HIGH SCHOOL (PRIMARY SECTION). Thus, I decided to use rbind() function to manually add both records to popular_primary_schools_sf.
For shopping malls, I used the dataset extracted by Valery Lim who webscrapped the shopping malls data from its wikipedia in 2019.
Rows: 184
Columns: 3
$ name <chr> "100 AM", "112 KATONG", "313@SOMERSET", "321 CLEMENTI", "600…
$ latitude <dbl> 1.274588, 1.305087, 1.301385, 1.312025, 1.334042, 1.437131, …
$ longitude <dbl> 103.8435, 103.9051, 103.8377, 103.7650, 103.8510, 103.7953, …
Taking a glimpse in our data, there is a total of 184 shopping malls in 2019.
Next, the code chunk below transforms our data with the correct Singapore coordinates system.
Same as before, we will conduct the following steps for data preprocessing, with an additional step of removing irrelevant columns for certain datasets:
removing irrelevant columns
check for missing values
check correct coordinates system
check for invalid geometries
So for our datasets, we only need to know the name and the geometry so in the following code chunks i will be removing/dropping irrelevant columns.
#mpsz
mpsz <- mpsz %>% select("Name")
#childcare_sf
#combine name and address postal to make it unique, some childcare have same name but diff locations
childcare_sf$full_address <- paste(childcare_sf$NAME, childcare_sf$ADDRESSP)
childcare_sf <- childcare_sf %>% select("full_address")
#eldercare_sf
eldercare_sf$full_address <- paste(eldercare_sf$NAME, eldercare_sf$ADDRESSPOSTALCODE)
eldercare_sf <- eldercare_sf %>% select("full_address")
#hawkercentre_new_sf
hawkercentre_new_sf <- hawkercentre_new_sf %>% select("NAME")
#kindergartens_sf
kindergartens_sf <- kindergartens_sf %>% select("NAME")
#nationalparks_sf
nationalparks_sf <- nationalparks_sf %>% select("NAME")
#supermarket
supermarket_sf <- supermarket_sf %>% select("Description")
#bus stop
bus_stop_sf$stop_name <- paste(bus_stop_sf$BUS_STOP_N, bus_stop_sf$BUS_ROOF_N, bus_stop_sf$LOC_DESC)
bus_stop_sf <- bus_stop_sf %>% select("stop_name")
#mrt
#combine stn name and exit to make each row unique
mrt_sf$stn <- paste(mrt_sf$stn_name, mrt_sf$exit_code)
mrt_sf <- mrt_sf %>% select("stn")[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
Based on the output, there is no missing values.
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: SVY21 / Singapore TM
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Coordinate Reference System:
User input: EPSG:3414
wkt:
PROJCRS["SVY21 / Singapore TM",
BASEGEOGCRS["SVY21",
DATUM["SVY21",
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4757]],
CONVERSION["Singapore Transverse Mercator",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",1.36666666666667,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",103.833333333333,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",1,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",28001.642,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",38744.572,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["northing (N)",north,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["easting (E)",east,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Cadastre, engineering survey, topographic mapping."],
AREA["Singapore - onshore and offshore."],
BBOX[1.13,103.59,1.47,104.07]],
ID["EPSG",3414]]
Based on the output, our datasets are in the correct coordinate systems.
[1] 9
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
[1] 0
Based on the output, we can see that only the mpsz dataset has invalid geometries. The code chunk beow addresses this issue.
The following code chunks shows some visualisations for the data we have just preprocessed.
The code chunk below shows the location of:
childcare centres - blue dots
eldercare centres - red dots
kindergartens - orange dots
primary schools - black dots
top 10 popular primary schools - grey dots
tmap_mode("view")
tm_shape(childcare_sf) +
tm_dots(alpha=0.5, #affects transparency of points
col="#2ec4b6",
size=0.05) +
tm_shape(eldercare_sf) +
tm_dots(alpha=0.5,
col="#e71d36",
size=0.05) +
tm_shape(kindergartens_sf) +
tm_dots(alpha=0.5,
col="#ff9f1c",
size=0.05) +
tm_shape(primary_school_sf) +
tm_dots(alpha=0.5,
col="#011627",
size=0.05) +
tm_shape(popular_primary_schools_sf) +
tm_dots(alpha=0.5,
col="grey",
size=0.05) +
tm_view(set.zoom.limits = c(11, 14))One thing we can observe is that there are more childcare centres as compared to the rest.
The code chunk below shows the location of:
supermarkets - red dots
shopping malls - orange dots
national parks - dark green dots
hawker centres - blue dots
tmap_mode("view")
tm_shape(supermarket_sf) +
tm_dots(alpha=0.5, #affects transparency of points
col="#d62828",
size=0.05) +
tm_shape(shopping_mall_sf) +
tm_dots(alpha=0.5,
col="#f77f00",
size=0.05) +
tm_shape(supermarket_sf) +
tm_dots(alpha=0.5,
col="#fcbf49",
size=0.05) +
tm_shape(nationalparks_sf) +
tm_dots(alpha=0.5,
col="#023020",
size=0.05) +
tm_shape(hawkercentre_new_sf) +
tm_dots(alpha=0.5,
col="#ADD8E6",
size=0.05) +
tm_view(set.zoom.limits = c(10, 14))We can observe that from July 2022 to December 2022, the are majority of the 5 room HDB flats can be found in the east side of Singapore and the higher priced flats tend to be in the southern-eastern side of Singapore.
As per our task , we need to find the proximity of certain facilities i.e proximity to CBD, eldercare, hawker centres, MRT, park, good primary school, shopping mall and supermarket. It can be computed by creating a function called proximity which utilises st_distance() to find the closest facility (shortest distance) with the rowMins() function of our matrixStats package. The values will then be appended to the resale_sf as a new column.
#CBD, eldercare, hawker centres, MRT, park, good primary school, shopping mall and supermarket.
resale_sf <-
proximity(resale_sf, cbd_sf, "PROX_CBD") %>%
proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
proximity(., hawkercentre_new_sf, "PROX_HAWKER") %>%
proximity(., mrt_sf, "PROX_MRT") %>%
proximity(., nationalparks_sf, "PROX_PARK") %>%
proximity(., popular_primary_schools_sf, "PROX_GDPRISCH") %>%
proximity(., shopping_mall_sf, "PROX_SHOPPINGMALL") %>%
proximity(., supermarket_sf, "PROX_SPRMKT")As per our task, we also want to find the number of facilities within a particular radius. Thus, we have to create another function called num_radius() which utilise st_distance() to compute the distance between the flats and the desiredfacilities, and then sum up the observations with rowSums(). The values will be appended to the data frame as a new column.
#Numbers of kindergartens within 350m, Numbers of childcare centres within 350m, Numbers of bus stop within 350m, Numbers of primary school within 1km
resale_sf <-
num_radius(resale_sf, kindergartens_sf, "NUM_KNDRGTN", 350) %>%
num_radius(., childcare_sf, "NUM_CHILDCARE", 350) %>%
num_radius(., bus_stop_sf, "NUM_BUS_STOP", 350) %>%
num_radius(., primary_school_sf, "NUM_PRISCH", 1000)Once again, I would like to save the latest resale_sf file.
Rows: 2,491
Columns: 25
$ month <chr> "2022-10", "2022-10", "2022-10", "2022-10", "2022-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", …
$ block <chr> "306", "310A", "551", "551", "431", "501", "253", …
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 1", "ANG MO KI…
$ storey_range <chr> "10 TO 12", "13 TO 15", "01 TO 03", "16 TO 18", "2…
$ floor_area_sqm <dbl> 123, 119, 118, 118, 119, 121, 137, 110, 110, 110, …
$ flat_model <chr> "Standard", "Improved", "Improved", "Improved", "I…
$ lease_commence_date <dbl> 46, 11, 42, 42, 44, 42, 27, 22, 22, 22, 5, 5, 5, 2…
$ remaining_lease <chr> "53.83", "89", "57.33", "57.33", "55.42", "57.33",…
$ resale_price <dbl> 700000, 980000, 568000, 603000, 720000, 670000, 80…
$ storey_order <int> 4, 5, 1, 6, 9, 3, 1, 7, 1, 5, 2, 9, 7, 8, 7, 4, 3,…
$ geometry <POINT [m]> POINT (29383.53 38640.51), POINT (29171.38 3…
$ PROX_CBD <dbl> 8625.861, 8544.547, 9537.543, 9537.543, 8876.533, …
$ PROX_ELDERCARE <dbl> 211.9637, 143.8115, 1064.6617, 1064.6617, 356.4709…
$ PROX_HAWKER <dbl> 331.2628, 496.9721, 482.8156, 482.8156, 375.3304, …
$ PROX_MRT <dbl> 573.2417, 797.7411, 1080.8607, 1080.8607, 365.2200…
$ PROX_PARK <dbl> 554.3174, 593.1602, 735.9373, 735.9373, 390.1731, …
$ PROX_GDPRISCH <dbl> 1526.5641, 1292.1608, 3212.9380, 3212.9380, 2376.2…
$ PROX_SHOPPINGMALL <dbl> 490.9497, 708.9643, 1213.2871, 1213.2871, 514.1228…
$ PROX_SPRMKT <dbl> 246.5007, 313.5439, 419.9139, 419.9139, 313.5926, …
$ NUM_KNDRGTN <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,…
$ NUM_CHILDCARE <dbl> 6, 2, 3, 3, 4, 1, 3, 4, 4, 3, 4, 4, 4, 4, 6, 5, 8,…
$ NUM_BUS_STOP <dbl> 6, 6, 2, 2, 6, 9, 11, 6, 6, 6, 6, 8, 8, 7, 7, 12, …
$ NUM_PRISCH <dbl> 2, 2, 1, 1, 3, 1, 3, 3, 3, 3, 2, 3, 3, 2, 2, 4, 4,…
As earlier mentioned, for the resale dataset,
Training dataset period: October 2022 to December 2022
Test dataset period: January 2023 to February 2023
Hence, the code chunk below reads the rds file - resale_sf which we created earlier on and splits the dataset accordingly.
For future references, I will be writing the train and test data set of resale_sf as a rds file.
The code chunk below reads our train and test dataset of resale_sf.
Rows: 1,496
Columns: 25
$ month <chr> "2022-10", "2022-10", "2022-10", "2022-10", "2022-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", "5 ROOM", …
$ block <chr> "306", "310A", "551", "551", "431", "501", "253", …
$ street_name <chr> "ANG MO KIO AVE 1", "ANG MO KIO AVE 1", "ANG MO KI…
$ storey_range <chr> "10 TO 12", "13 TO 15", "01 TO 03", "16 TO 18", "2…
$ floor_area_sqm <dbl> 123, 119, 118, 118, 119, 121, 137, 110, 110, 110, …
$ flat_model <chr> "Standard", "Improved", "Improved", "Improved", "I…
$ lease_commence_date <dbl> 46, 11, 42, 42, 44, 42, 27, 22, 22, 22, 5, 5, 5, 2…
$ remaining_lease <chr> "53.83", "89", "57.33", "57.33", "55.42", "57.33",…
$ resale_price <dbl> 700000, 980000, 568000, 603000, 720000, 670000, 80…
$ storey_order <int> 4, 5, 1, 6, 9, 3, 1, 7, 1, 5, 2, 9, 7, 8, 7, 4, 3,…
$ geometry <POINT [m]> POINT (29383.53 38640.51), POINT (29171.38 3…
$ PROX_CBD <dbl> 8625.861, 8544.547, 9537.543, 9537.543, 8876.533, …
$ PROX_ELDERCARE <dbl> 211.9637, 143.8115, 1064.6617, 1064.6617, 356.4709…
$ PROX_HAWKER <dbl> 331.2628, 496.9721, 482.8156, 482.8156, 375.3304, …
$ PROX_MRT <dbl> 573.2417, 797.7411, 1080.8607, 1080.8607, 365.2200…
$ PROX_PARK <dbl> 554.3174, 593.1602, 735.9373, 735.9373, 390.1731, …
$ PROX_GDPRISCH <dbl> 1526.5641, 1292.1608, 3212.9380, 3212.9380, 2376.2…
$ PROX_SHOPPINGMALL <dbl> 490.9497, 708.9643, 1213.2871, 1213.2871, 514.1228…
$ PROX_SPRMKT <dbl> 246.5007, 313.5439, 419.9139, 419.9139, 313.5926, …
$ NUM_KNDRGTN <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,…
$ NUM_CHILDCARE <dbl> 6, 2, 3, 3, 4, 1, 3, 4, 4, 3, 4, 4, 4, 4, 6, 5, 8,…
$ NUM_BUS_STOP <dbl> 6, 6, 2, 2, 6, 9, 11, 6, 6, 6, 6, 8, 8, 7, 7, 12, …
$ NUM_PRISCH <dbl> 2, 2, 1, 1, 3, 1, 3, 3, 3, 3, 2, 3, 3, 2, 2, 4, 4,…
Having a glimpse at our training and test data, I realised that there are some columns that are irrelevant for our subsequent tasks such as flat_model , town, flat_type, block and street_name. Thus, I removed it.
The lease_commence_date represents the age of the unit that we had calculated earlier on but its name can be misleading so I changed the name to age.
Looking at our data again, the type of age and remaining_lease is characters but it should be numeric/integer so i changed it accordingly.
Finally, we are done processing our train and test data for our subsequent tasks!
The code chunk below shows a summary of our resale prices.
Before loading the predictors into a predictive model, it is always a good practice to use correlation matrix to examine if there is sign of multicolinearity.
But before that, we need to drop our geometry values.
The code chunk below saves the rds file for training data without geometry so we can use it later.
The code chunk below computes the correlation matrix.

In general, an absolute correlation coefficient of >0.7 among two or more predictors indicates the presence of multicollinearity. As we have a lot of variables, it is quite hard to see the correlation value, but most seems to be okay except age with its correlation to remaining_lease being -1. This indicates that age of the unit and the remaining lease have perfect multicollinearity and we should remove either the age or remaining lease when creating our linear regression model.
The code chunk below removes the age of unit from our dataset.
Another way to check multicolinearity is looking at the Variance Inflation Factor (VIF) which we will perform later.
First, let’s build our multi regression model where y is resale_price and x is the independent predictors.
Call:
lm(formula = resale_price ~ ., data = train_nogeo)
Residuals:
Min 1Q Median 3Q Max
-236572 -46059 -4003 39815 418533
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.524e+05 5.764e+04 -4.379 1.28e-05 ***
floor_area_sqm 6.240e+03 3.570e+02 17.477 < 2e-16 ***
remaining_lease 5.952e+03 2.337e+02 25.473 < 2e-16 ***
storey_order 2.131e+04 1.029e+03 20.697 < 2e-16 ***
PROX_CBD -2.227e+01 6.755e-01 -32.962 < 2e-16 ***
PROX_ELDERCARE 1.160e+01 3.172e+00 3.658 0.000263 ***
PROX_HAWKER -2.465e+01 3.581e+00 -6.883 8.61e-12 ***
PROX_MRT -2.826e+01 5.621e+00 -5.027 5.59e-07 ***
PROX_PARK 1.900e+01 4.351e+00 4.367 1.35e-05 ***
PROX_GDPRISCH -4.721e-01 1.276e+00 -0.370 0.711487
PROX_SHOPPINGMALL -4.620e+00 6.324e+00 -0.731 0.465180
PROX_SPRMKT 1.459e+01 1.373e+01 1.062 0.288234
NUM_KNDRGTN 1.154e+04 2.038e+03 5.663 1.78e-08 ***
NUM_CHILDCARE -4.383e+03 1.040e+03 -4.216 2.63e-05 ***
NUM_BUS_STOP 5.283e+02 6.902e+02 0.765 0.444202
NUM_PRISCH -9.436e+03 1.412e+03 -6.681 3.36e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 75210 on 1480 degrees of freedom
Multiple R-squared: 0.7083, Adjusted R-squared: 0.7054
F-statistic: 239.6 on 15 and 1480 DF, p-value: < 2.2e-16
Based on the output, we can infer that the adjusted R square is 0.7054 which roughly indicates that 70% of the variance of the dependent variable being studied is explained by the variance of the independent variable. It seems that most of the predictors are significant (p-value<0.5) except for PROX_SPRMKT, PROX_SHOPPINGMALL, PROX_GDPRISCH and NUM_BUS_STOP.
We can use ols_vif_tol() function of our olsrr package to calculate VIF. In general, if the VIF value is less than 5, then there is usually no sign/possibility of correlations.
Variables Tolerance VIF
1 floor_area_sqm 0.5490904 1.821194
2 remaining_lease 0.4939016 2.024695
3 storey_order 0.8716868 1.147201
4 PROX_CBD 0.4984548 2.006200
5 PROX_ELDERCARE 0.7790581 1.283601
6 PROX_HAWKER 0.7879037 1.269191
7 PROX_MRT 0.8119679 1.231576
8 PROX_PARK 0.8281714 1.207480
9 PROX_GDPRISCH 0.7162116 1.396235
10 PROX_SHOPPINGMALL 0.7027136 1.423055
11 PROX_SPRMKT 0.8063695 1.240126
12 NUM_KNDRGTN 0.7460480 1.340396
13 NUM_CHILDCARE 0.6323711 1.581350
14 NUM_BUS_STOP 0.8319573 1.201985
15 NUM_PRISCH 0.6071676 1.646992
Th higher the VIF, the less reliable our regression results are going to be. Based on the output, there are no signs of multicollinearity as VIF <5.
The code chunk below saves the results of our model to an rds file.
In this section, we will calibrate our mlr1 model to predict HDB resale prices.
class : SpatialPointsDataFrame
features : 1496
extent : 6958.193, 42645.18, 28157.26, 48741.06 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
variables : 16
names : floor_area_sqm, remaining_lease, resale_price, storey_order, PROX_CBD, PROX_ELDERCARE, PROX_HAWKER, PROX_MRT, PROX_PARK, PROX_GDPRISCH, PROX_SHOPPINGMALL, PROX_SPRMKT, NUM_KNDRGTN, NUM_CHILDCARE, NUM_BUS_STOP, ...
min values : 104, 49.75, 415000, 1, 1610.9563636452, 0.00121769219557756, 49.4878683203785, 27.3898687165439, 60.0396043524142, 115.665591934146, 0.00118710679792838, 0.00109267327868705, 0, 0, 0, ...
max values : 149, 95.33, 1345000, 14, 23277.3731998479, 8265.9709141931, 6633.82585101601, 2070.17690667209, 6003.01683926018, 9048.73399591188, 4009.20093399522, 1451.97426427893, 7, 19, 19, ...
Doing cross-validation while utilising a gaussian kernel, the smallest CV score is 3.054284e+12 and its adaptive bandwidth is 22.
Next, we will save the adaptive bandwidth as an rds file.
First, let’s call the saved bandwith by using the code chunk below.
Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.
The code chunk below will be used to save the model in rds format for future use.
The code below can be used to display the model output.
***********************************************************************
* Package GWmodel *
***********************************************************************
Program starts at: 2023-03-26 19:54:55
Call:
gwr.basic(formula = resale_price ~ ., data = train_data_sp, bw = bw_adaptive,
kernel = "gaussian", adaptive = TRUE, longlat = FALSE)
Dependent (y) variable: resale_price
Independent variables: .
Number of data points: 1496
***********************************************************************
* Results of Global Regression *
***********************************************************************
Call:
lm(formula = formula, data = data)
Residuals:
Min 1Q Median 3Q Max
-232352 -45529 -4238 37272 385542
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.973e+05 5.911e+04 -3.338 0.000863 ***
floor_area_sqm 6.251e+03 3.568e+02 17.518 < 2e-16 ***
remaining_lease 6.167e+03 2.397e+02 25.730 < 2e-16 ***
storey_order 2.088e+04 1.038e+03 20.112 < 2e-16 ***
PROX_CBD -1.891e+01 1.077e+00 -17.560 < 2e-16 ***
PROX_ELDERCARE 1.140e+01 3.175e+00 3.589 0.000342 ***
PROX_HAWKER -2.752e+01 3.720e+00 -7.398 2.30e-13 ***
PROX_MRT -3.121e+01 5.644e+00 -5.530 3.79e-08 ***
PROX_PARK 2.249e+01 4.435e+00 5.070 4.48e-07 ***
PROX_GDPRISCH 1.427e+00 1.433e+00 0.996 0.319346
PROX_SHOPPINGMALL -6.769e+00 6.302e+00 -1.074 0.282938
PROX_SPRMKT 1.095e+01 1.375e+01 0.797 0.425714
NUM_KNDRGTN 1.036e+04 2.050e+03 5.054 4.86e-07 ***
NUM_CHILDCARE -4.090e+03 1.036e+03 -3.950 8.20e-05 ***
NUM_BUS_STOP 1.515e+02 6.918e+02 0.219 0.826681
NUM_PRISCH -8.990e+03 1.447e+03 -6.213 6.75e-10 ***
coords.x1 7.400e-01 3.382e-01 2.188 0.028814 *
coords.x2 -3.416e+00 7.531e-01 -4.536 6.20e-06 ***
---Significance stars
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 74740 on 1478 degrees of freedom
Multiple R-squared: 0.7123
Adjusted R-squared: 0.709
F-statistic: 215.3 on 17 and 1478 DF, p-value: < 2.2e-16
***Extra Diagnostic information
Residual sum of squares: 8.257033e+12
Sigma(hat): 74342.42
AIC: 37841.04
AICc: 37841.56
BIC: 36584.84
***********************************************************************
* Results of Geographically Weighted Regression *
***********************************************************************
*********************Model calibration information*********************
Kernel function: gaussian
Adaptive bandwidth: 22 (number of nearest neighbours)
Regression points: the same locations as observations are used.
Distance metric: Euclidean distance metric is used.
****************Summary of GWR coefficient estimates:******************
Min. 1st Qu. Median 3rd Qu. Max.
Intercept -8.5125e+08 -9.6843e+06 1.5185e+06 1.3389e+07 5.3217e+09
floor_area_sqm -4.0120e+03 2.5913e+03 3.9227e+03 5.9585e+03 1.5536e+06
remaining_lease -9.7555e+04 5.4572e+03 7.2246e+03 9.0025e+03 4.9399e+04
storey_order 6.3018e+03 1.2182e+04 1.5164e+04 1.8695e+04 2.5594e+04
PROX_CBD -2.9220e+04 -5.6422e+02 -2.5133e+00 4.8308e+02 2.3844e+05
PROX_ELDERCARE -4.8216e+04 -1.2686e+01 1.9113e+01 6.4787e+01 1.2264e+03
PROX_HAWKER -1.3614e+03 -4.4118e+01 -6.8652e+00 3.3078e+01 1.1669e+04
PROX_MRT -1.4344e+04 -9.4417e+01 -5.0592e+01 -4.7851e+00 8.5307e+02
PROX_PARK -6.4812e+02 -4.2881e+01 -9.2976e+00 2.3041e+01 1.5562e+03
PROX_GDPRISCH -3.9664e+03 -3.6373e+01 5.5693e+00 5.4074e+01 8.6887e+03
PROX_SHOPPINGMALL -7.0975e+02 -6.6587e+01 -2.0800e+01 2.1314e+01 1.1766e+04
PROX_SPRMKT -1.0656e+03 -3.0144e+01 5.7481e+00 4.8561e+01 2.5178e+03
NUM_KNDRGTN -5.2077e+04 -3.3102e+03 3.0430e+03 1.2316e+04 1.7806e+05
NUM_CHILDCARE -5.1618e+04 -3.3261e+03 1.7252e+02 2.4240e+03 2.2769e+04
NUM_BUS_STOP -4.4570e+04 -8.6265e+02 1.3844e+03 3.6228e+03 2.4699e+04
NUM_PRISCH -6.5807e+04 -5.8145e+03 2.0226e+03 8.5195e+03 2.8605e+05
coords.x1 -2.3903e+04 -3.5479e+02 -5.3459e+01 1.5550e+02 1.3272e+04
coords.x2 -1.8906e+05 -3.2348e+02 -4.1838e+00 4.0785e+02 2.9058e+04
************************Diagnostic information*************************
Number of data points: 1496
Effective number of parameters (2trace(S) - trace(S'S)): 552.7639
Effective degrees of freedom (n-2trace(S) + trace(S'S)): 943.2361
AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 36424.99
AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 35569.1
BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 36939.87
Residual sum of squares: 1.369059e+12
R-square value: 0.9523041
Adjusted R-square value: 0.9243233
***********************************************************************
Program stops at: 2023-03-26 19:54:57
From the output above, we can observe that the GWR Adjusted R-square is 0.9243233, which is higher than the non-spatial mulitiple linear regression model’s Adjusted R-square of 0.709 . Based on research, R-squared measures the goodness of fit of a regression model. Hence, a higher R-squared indicates the model is a good fit, while a lower R-squared indicates the model is not a good fit which suggests that the GWR model is a better fit.
GWR model also has a lower AIC i.e 35569.1 than the regression model’s AIC i.e 37841.04. Based on research, the lower AIC scores, the better it is, as AIC penalizes models that use more parameters.
The code chunk below extract the x,y coordinates of the full, training and test data sets.
Before continue, we write all the output into rds for future uses
Now, we need to retrieve our training data, which does not have the geometry.
Now, we will calibrate a model to predict HDB resale price by using random forest function of ranger package. Set.seed() is used to keep results constant as random forest is a simulation. Based on research, random forest uses bootstrap sampling and feature sampling, i.e row sampling and column sampling. Therefore, Random Forest is not affected by multicollinearity that much since it is picking different set of features for different models and every model sees a different set of data points. This means that I do not need to remove the age variable.
Ranger result
Call:
ranger(resale_price ~ ., data = train_nogeo_rds, num.trees = 30)
Type: Regression
Number of trees: 30
Sample size: 1496
Number of independent variables: 16
Mtry: 4
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 3246290896
R squared (OOB): 0.830922
Based on the output, for a random forest of 30 trees, the OOB prediction error (MSE) is 3246290896 and the R squared (OOB) / R2 is 0.830922 which implies that around 83% of the variance of the train data is explained by the model.
Random forest rmse is 56976.23.
Next, we would like to calibrate the Geographically Weighted Random Forest by using the grf.bw() function. As the grf.bw() function is computationally expensive, I will be running it on a subset of the data (500 rows). Even though the bandwidth of using 500 rows would differ a lot from the bandwidth of using the whole dataset, but for this assignment approximating it on a subset is good enough as a proof of concept.
gwRF_bw <- grf.bw(formula = resale_price ~
age +
remaining_lease +
floor_area_sqm +
storey_order +
PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT +
PROX_SHOPPINGMALL +
NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH,
data = train_nogeo_rds[0:500,],
kernel = "adaptive",
trees = 30,
coords = coords_train[0:500,]
)As computation time is too long, I will use what has been generated so far. Based on the output, the best bandwidth is 85 with a R2 of 0.761592272965718.
Next, I used the bandwidth discovered previously to calibrate our GWRF model. Afterwards, I saved it into an rds file.
set.seed(1234)
gwRF_adaptive <- grf(resale_price ~
age +
remaining_lease +
floor_area_sqm + storey_order +
PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT +
PROX_SHOPPINGMALL +
NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH,
dframe=train_nogeo_rds,
bw= 85,
ntree = 30,
kernel="adaptive",
coords=coords_train)
write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")Rows: 1,496
Columns: 16
$ age <dbl> 320942757446, 448629756899, 647513499118, 7516663368…
$ remaining_lease <dbl> 657786726120, 422099032262, 492218074216, 5226643315…
$ floor_area_sqm <dbl> 41006992058, 30519375720, 105074950776, 185876407538…
$ storey_order <dbl> 110960855171, 104047373638, 101896718991, 8783605220…
$ PROX_CBD <dbl> 140999708606, 102437070763, 204273123777, 9483130917…
$ PROX_ELDERCARE <dbl> 73215701479, 53718852857, 351150757572, 246294051248…
$ PROX_HAWKER <dbl> 105058129002, 155231371012, 239672497719, 1388072614…
$ PROX_MRT <dbl> 103492994640, 59711928524, 177245309132, 14323796576…
$ PROX_PARK <dbl> 37696197821, 42225183408, 69473008268, 80618167170, …
$ PROX_GDPRISCH <dbl> 96507080949, 102504561673, 179261581918, 25158876663…
$ PROX_SPRMKT <dbl> 37300420254, 242577557255, 23842728322, 45316038072,…
$ PROX_SHOPPINGMALL <dbl> 57409486399, 47797347471, 29974885175, 38763840047, …
$ NUM_KNDRGTN <dbl> 53090623507, 55873222840, 10019759988, 12115004309, …
$ NUM_CHILDCARE <dbl> 10102173169, 18880970827, 6813724611, 17966864708, 9…
$ NUM_BUS_STOP <dbl> 28784760335, 33066494085, 7765459152, 10159718996, 1…
$ NUM_PRISCH <dbl> 95700047514, 13935912388, 5150098017, 12549017005, 6…
Ranger result
Call:
ranger(resale_price ~ age + remaining_lease + floor_area_sqm + storey_order + PROX_CBD + PROX_ELDERCARE + PROX_HAWKER + PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT + PROX_SHOPPINGMALL + NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH, data = train_nogeo_rds, num.trees = 30, mtry = 5, importance = "impurity", num.threads = NULL)
Type: Regression
Number of trees: 30
Sample size: 1496
Number of independent variables: 16
Mtry: 5
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 3204428693
R squared (OOB): 0.8331023
The output shows that the R2 is 0.8331023. for a GWR model of 30 trees and a bandwidth of 86.
The code chunk below uses predict.lm() function of the stats package to run inference on our test data and save it into RDS format.
The code chunk below uses predict() function of the ranger package to run inference on our test data and save it into RDS format.
The code chunk below will be used to combine the test data with its corresponding coordinates data.
Next, predict.grf() of spatialML package will be used to predict the resale value by using the test data and gwRF_adaptive model calibrated earlier.
Before moving on, let us save the output into rds file for future use.
The output of the predict.grf() is a vector of predicted values. It is wiser to convert it into a data frame for further visualisation and analysis.
In the code chunk below, cbind() is used to append the predicted values for both the GRF and multiple linear regression mode onto test_data.
To evaluate the 3 models, we can combine the resale price and predicted prices of each model into a single dataframe.
ols_pred_df <- read_rds("data/model/ols_pred.rds") %>%
as.data.frame()
colnames(ols_pred_df) <- c("ols_pred")
rf_pred_df <- read_rds("data/model/rf_pred.rds")$predictions %>%
as.data.frame()
colnames(rf_pred_df) <- c("rf_pred")
gwRF_pred_df <- GRF_pred %>%
as.data.frame()
colnames(gwRF_pred_df) <- c("gwrf_pred")
prices_pred_df <- cbind(test_data$resale_price, ols_pred_df, rf_pred_df,
gwRF_pred_df) %>%
rename("actual_price" = "test_data$resale_price")
head(prices_pred_df) actual_price ols_pred rf_pred gwrf_pred
1 682888 728585.5 769835.8 728135.3
2 695000 643365.4 660018.7 698225.9
3 658888 686182.7 754647.1 689452.9
4 790000 789026.6 763765.2 856222.9
5 800000 704982.1 762599.3 782600.2
6 858000 831632.1 755024.8 827002.7
Looking at the results, it seems like gwrf model has the closest predicted value to the actual price as compared to the other models.
The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis.
In the code chunk below, rmse() of Metrics package is used to compute the RMSE for GRF model.
[1] 53185.72
Comparing the 3 models RMSE, the GWRF model has the lowest RMSE of 53185.72 as compared to OLS model’s RMSE of 69388.4 and random forest model’s RMSE of 55187.82. RMSE measures the average difference between values predicted by a model and the actual values and provides an estimation of how well the model is able to predict the target value (accuracy). Hence, among the 3 models, the GWRF model will be better model at predicting the resale price.
test_data_models <- cbind(test_data, prices_pred_df)
gwr_plot <- ggplot(data = test_data_p,
aes(x = GRF_pred,
y = resale_price)) +
geom_point()+
geom_abline(col = "Red")
lm_plot <- ggplot(data = test_data_models,
aes(x = ols_pred,
y = actual_price)) +
geom_point()+
geom_abline(col = "Red")
rf_plot <- ggplot(data = test_data_models,
aes(x = rf_pred,
y = actual_price)) +
geom_point()+
geom_abline(col = "Red")
plot_grid(gwr_plot, lm_plot, rf_plot, labels = "AUTO")
By comparing the 3 graphs, GWRF model is more linear while OLS and random forest model points on the right half side, majority of it is above the red line. Thus, we can see that the GWRF model is better than the OLS and random forest model as the scatter points are more closer to the diagonal line.
In conclusion, be comparing the 3 models’ performance at predicting the actual resale prices, the RMSE and visualisation of its scatterplots, we can determine that GWRF model is the best model at predicting the actual resale prices. One of the reasons why it performs better is because GWR is an outgrowth of ordinary least squares regression (OLS) and adds a level of modeling sophistication by allowing the relationships between the independent and dependent variables to vary by locality.
To conclude, I would like to thank Prof. Kam for our IS415 Geospatial Analytics and Applications course materials & resources. I would also like to thank my seniors, Megan Sim and Nor Aisyah Binte Ajit as I have referenced their codes.
---
title: "Take-home Assignment 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods"
date: "9 March 2023"
date-modified: last-modified
author: "Rhonda Ho Kah Yee"
format:
html:
code-fold: true
code-tools: true
execute:
message: false
warning: false
title-block-banner: true
editor: visual
---
# Overview
Hello! This is Rhonda Ho's take-home Assignment 3 for IS415 module.
To view/hide all the code at once, please click on the "\</\> code" tab beside the title of this html document and select the option to view/hide the code.
The full details of this assignment can be found [here](https://is415-ay2022-23t2.netlify.app/th_ex3.html).
## Task
In this take-home exercise, I am tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. I am also required to compare the performance of the conventional OLS method versus the geographical weighted methods.
## The Data
Below is the list of datasets i have collected.
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Type | Dataset | File Format |
+======================+=================================================================================================================+=============+
| Aspatial | [Resale Flat Prices](https://data.gov.sg/dataset/resale-flat-prices) | .csv |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial | [Master Plan 2014 Subzone Boundary (Web)](https://data.gov.sg/dataset/master-plan-2014-subzone-boundary-web) | .kml |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial | [Bus Stop Location](https://datamall.lta.gov.sg/content/datamall/en/search_datasets.html?searchText=bus%20stop) | .kml |
| | | |
| | | .shp |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial | [Train Station](https://datamall.lta.gov.sg/content/datamall/en/search_datasets.html?searchText=Train) | .kml |
| | | |
| | | .shp |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial | [School Directory and Information](https://data.gov.sg/dataset/school-directory-and-information) | .csv |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial-Extracted | Child Care Services | .shp |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial-Extracted | Eldercare Services | .rds |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial-Extracted | Hawker Centres | .rds |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial-Extracted | Kindergarterns | .rds |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial-Extracted | Parks | .rds |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial | [Supermarket](https://data.gov.sg/dataset/supermarkets) | .geoson |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
| Geospatial | [Shopping Malls](https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper) | .csv |
+----------------------+-----------------------------------------------------------------------------------------------------------------+-------------+
# Getting Started
The following packages will be used:
- [**sf**](https://cran.r-project.org/web/packages/sf/index.html): used for importing, managing, and processing geospatial data
- [**tidyverse**](https://www.tidyverse.org/): a collection of packages for data science tasks
- [**tmap**](https://cran.r-project.org/web/packages/tmap/index.html): used for creating thematic maps, such as choropleth and bubble maps
- [**sfdep**](https://cran.r-project.org/web/packages/sfdep/index.html): used to create spatial weights matrix objects, global and local spatial autocorrelation statistics and related calculations (e.g. spatially lag attributes)
- [**onemapsgapi**](https://cran.r-project.org/web/packages/onemapsgapi/index.html): used to query Singapore-specific spatial data, alongside additional functionalities.
- [**units**](https://cran.r-project.org/web/packages/units/index.html): used to for manipulating numeric vectors that have physical measurement units associated with them
- [**matrixStats**](https://cran.r-project.org/web/packages/matrixStats/index.html): a set of high-performing functions for operating on rows and columns of matrices
- [**jsonlite**](https://cran.r-project.org/web/packages/jsonlite/vignettes/json-aaquickstart.html): a JSON parser that can convert from JSON to the appropraite R data types
- [**olsrr**](https://cran.r-project.org/web/packages/olsrr/index.html): used for building least squares regression models
- [**httr**](https://cran.r-project.org/web/packages/httr/) : used to make API calls, such as a GET request
- [**ggpubr**](https://cran.r-project.org/web/packages/ggpubr/index.html): used for multivariate data visualisation & analysis
- [**GWmodel**](https://cran.r-project.org/web/packages/GWmodel/index.html): provides a collection of localised spatial statistical methods, such as summary statistics, principal components analysis, discriminant analysis and various forms of GW regression
- [**SpatialML**](https://cran.r-project.org/web/packages/SpatialML/index.html)**:** allows for a geographically weighted random forest regression including a function to find the optical bandwidth
- [**Ranger**](https://cran.r-project.org/web/packages/ranger/index.html)**:** a fast implementation of Random Forests, particularly suited for high dimensional data
- [**cowplot**](https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html): a simple add-on to ggplot and provides various features that help with creating publication-quality figures
- [**Metrics**](https://cran.r-project.org/web/packages/Metrics/Metrics.pdf)**:** allow us to calculate the rmse of our models
```{r}
pacman::p_load(readxl, sf, tidyverse, units, matrixStats, olsrr , gdata, tmap, sfdep, jsonlite, onemapsgapi, rvest, ranger, SpatialML, readxl, GWmodel, httr, cowplot, Metrics )
```
# Importing Data and Wrangling
Before I can perform my tasks, I need to obtain the following appropriate independent variables from my datasets.
**Structural factors:**
1. Area of the unit
2. Floor level
3. Remaining lease
4. Age of the unit
**Locational factors**
1. Proximity to CBD
2. Proximity to eldercare
3. Proximity to foodcourt/hawker centres
4. Proximity to MRT
5. Proximity to park
6. Proximity to good primary school
7. Proximity to shopping mall
8. Proximity to supermarket
9. Numbers of kindergartens within 350m
10. Numbers of childcare centres within 350m
11. Numbers of bus stop within 350m
12. Numbers of primary school within 1km
## Aspatial Data
### Resale Flat Prices
```{r}
resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")
glimpse(resale)
```
Having a glimpse of our data, we can observe that this dataset has a total of 11 columns and 149071 rows. The columns consist of month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price.
For this assignment, the dataset will focus on:
- Transaction period: Oct 2022 to February 2023
- Training dataset period: Oct 2022 to December 2022
- Test dataset period: January 2023 to February 2023
- Type of rook flat: 5-room flats
The code chunk filters our dataset accordingly.
```{r}
resale<- resale %>%
filter(flat_type == "5 ROOM") %>%
filter(month >= "2022-10" & month <= "2023-02")
```
Based on my [senior's](https://is415-msty.netlify.app/posts/2021-10-25-take-home-exercise-3/?panelset1=extracted2&panelset2=base3&panelset3=base4&panelset=base&panelset4=sg) experience, "ST." is usually written as "SAINT" instead - for example, St. Luke's Primary School is written as Saint Luke's Primary School. To address, this, we'll replace such occurrences:
```{r}
resale$street_name <- gsub("ST\\.", "SAINT", resale$street_name)
```
Subsequently, as our dataset is missing the coordinates for the location, I created a function, `geocode()`, which calls onemapAPI to retrieve the geometry of each location.
```{r}
#library(httr)
geocode <- function(block, streetname) {
base_url <- "https://developers.onemap.sg/commonapi/search"
address <- paste(block, streetname, sep = " ")
query <- list("searchVal" = address,
"returnGeom" = "Y",
"getAddrDetails" = "N",
"pageNum" = "1")
res <- GET(base_url, query = query)
restext<-content(res, as="text")
output <- fromJSON(restext) %>%
as.data.frame %>%
select(results.LATITUDE, results.LONGITUDE)
return(output)
}
```
While exploring the [API](https://app.swaggerhub.com/apis/onemap-sg/new-onemap-api/1.0.4#/OneMap%20REST%20APIs/search), I realised that the best search parameter would be to combine the column block with its street_name. Thus, the function `geocode()` takes in both the block and street name .
```{r}
#| eval: false
resale$LATITUDE <- 0
resale$LONGITUDE <- 0
for (i in 1:nrow(resale)){
temp_output <- geocode(resale[i, 4], resale[i, 5])
resale$LATITUDE[i] <- temp_output$results.LATITUDE
resale$LONGITUDE[i] <- temp_output$results.LONGITUDE
}
```
To reuse the resale dataset, without calling the API, I saved it into a rds file.
```{r}
#| eval: false
write_rds(resale, "data/rds/resale.rds")
```
```{r}
resale_rds<-readRDS("data/rds/resale.rds")
```
#### Structural Data
In this section, I will be pre processing the structural data that I need for my tasks.
##### Floor Level
let's first take a look at the storey_range values.
```{r}
sort(unique(resale_rds$storey_range))
```
We can see that there are 17 storey level ranges categories. The following code chunk recodes the categorical naming to numerical values by assigning 1 to the first range 01 TO 03 and 17 to the last range 49 TO 51.
```{r}
storey <- sort(unique(resale_rds$storey_range))
storey_order <- 1:length(storey)
storey_range_order <- data.frame(storey, storey_order)
storey_range_order
```
Next, I will combine it to the original resale df.
```{r}
resale_rds <- left_join(resale_rds, storey_range_order, by = c("storey_range" = "storey"))
```
##### Remaining Lease
Currently, the remaining_lease is in string format but we need it to be numeric. Thus, we need to split the string into month and year and then replace the original values with the calculated value of the remaining lease in years.
```{r}
#e.g of resale$remaining_lease - 53 years 10 months
str_list <- str_split(resale_rds$remaining_lease, " ")
#after spliting, [53, years, 10, months]
for (i in 1:length(str_list)){
if (length(unlist(str_list[i])) > 2) {
year <- as.numeric(unlist(str_list[i])[1])
month <- as.numeric(unlist(str_list[i])[3])
resale_rds$remaining_lease[i] <- year + round(month/12, 2)
}
else {
year <- as.numeric(unlist(str_list[i])[1])
resale_rds$remaining_lease[i] <- year
}
}
```
##### Age of Unit
To get the age of the unit, I decided to take the current year i.e 2023 and minus the lease commence date year of the the unit.
```{r}
str_list <- resale_rds$lease_commence_date
for (i in 1:length(str_list)){
resale_rds$lease_commence_date[i] <- 2023 -
resale_rds$lease_commence_date[i]
}
```
Finally, we can convert it into a sf.
```{r}
resale_sf <- st_as_sf(resale_rds,
coords = c("LONGITUDE",
"LATITUDE"),
crs=4326) %>%
st_transform(crs = 3414)
```
### Data Pre-processing
Under this section, we will check for:
- check for missing values
- check correct coordinates system
- check for invalid geometries
#### Check for Missing Values
```{r}
#resale_sf
sum(is.na(resale_sf))
```
We can observe that we have no missing values.
#### Check for Correct Coordinates System
```{r}
#resale_sf
st_crs(resale_sf)
```
Our dataset have the correct Singapore coordinates system of 3414.
#### Check for Invalid Geometries
```{r}
#resale_sf
length(which(st_is_valid(resale_sf) == FALSE))
```
Based on the output, our geometries are valid.
Once again, I saved the processed resale_sf in an rds file.
```{r}
#| eval: false
write_rds(resale_sf, "data/rds/resale_sf.rds")
```
## Geospatial Data
### Master Plan 2019 Boundary
The code chunk below retrieves the geospatial polygon data for Singapore's region in 2019.
```{r}
mpsz <- st_read(dsn="data/geospatial/MP14_SUBZONE_WEB_PL.kml") %>%
st_transform(crs = 3414)
```
### Locational Factors Extracted via onemapAPI token
For some of the locational factors, I will be utilising onemapAPI to retrieve its geometry data.
One method I discovered was the usage of a token to retrieve geometry of locational factors based on its related theme. But before we can proceed on, we need to register for account [here](https://www.onemap.gov.sg/docs/#register-free-account) and retrieve the token.
```{r}
#| eval: false
token <- get_token("user@example.com", "password")
```
The code chunk below helps us view the available themes. As a token is needed, I first saved the output in an rds file.
```{r}
#| eval: false
#retrieve available themes that we can refer to
avail_themes <-search_themes(token)
write_rds(avail_themes, "data/rds/available_themes.rds")
```
By reading the according file, I sorted the themes in alphabetical order, for easier reference.
```{r}
#read the file for available themes
avail_themes<-readRDS("data/rds/available_themes.rds")
#sort by alphabetical order
avail_themes<-avail_themes[order(avail_themes$THEMENAME),]
avail_themes
```
Looking through the available themes from onemap API, some of the themes relevant to our tasks is:
| Theme Name | Query Name |
|---------------------------------------------------------------------------------------------------------------|------------------|
| Child Care Services | childcare |
| Eldercare Services | eldercare |
| Hawker Centres_New (Note: there are other similar themes such as Hawker Centres and Healthier Hawker Centres) | hawkercentre_new |
| Kindergartens | kindergartens |
| Parks (Note: there are other similar themes such as Parks\@SG and NParks Parks and Nature Reserves) | nationalparks |
For the following code chunks, I created a shapefile for each locational factor I am interested in.
The steps taken are:
1. Retrieve data such as the geometry and name of the place/amenity by using onemap's get_theme() function which takes in a theme (i.e query name) and the store the data in a df
2. Convert the df to a sf object by using the st_as_sf() function
3. Transform crs information to [Singapore coordinates system](https://epsg.io/3414) i.e 3414 by using st_transform() function
4. Write to a shapefile using st_write() function
::: panel-tabset
## Childcare
``` r
#| eval: false
#theme: childcare
#retrieve the data such as the geometry and name accordingly to the theme
childcare_tibble <- get_theme(token, "childcare")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
childcare_sf <- st_as_sf(childcare_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
st_write(obj = childcare_sf,
dsn = "data/geospatial",
layer = "childcare",
driver = "ESRI Shapefile",
append=FALSE)
```
## Eldercare
``` r
#| eval: false
#theme: eldercare
#retrieve the data such as the geometry and name based accordingly to the theme
eldercare_tibble <- get_theme(token, "eldercare")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
eldercare_sf <- st_as_sf(eldercare_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(eldercare_sf, "data/rds/eldercare.rds")
```
## Hawker Centres
``` r
#| eval: false
#theme: new hawker centres
#retrieve the data such as the geometry and name based accordingly to the theme
#hawkercentre_new_tibble <- get_theme(token, "hawkercentre_new")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
hawkercentre_new_sf <- st_as_sf(hawkercentre_new_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(hawkercentre_new_sf, "data/rds/hawker_new.rds")
#issues writing in this manner
#st_write(obj = hawkercentre_new_sf,
# dsn = "data/geospatial",
# layer = "hawkercentre_new",
# driver = "ESRI Shapefile",
# append=FALSE)
```
## Kindergartens
``` r
#| eval: false
#theme: kindergartens
#retrieve the data such as the geometry and name based accordingly to the theme
kindergartens_tibble <- get_theme(token, "kindergartens")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
kindergartens_sf <- st_as_sf(kindergartens_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(kindergartens_sf, "data/rds/kindergartens.rds")
```
## Parks
``` r
#| eval: false
#theme: parks
#retrieve the data such as the geometry and name based accordingly to the theme
nationalparks_tibble <- get_theme(token, "nationalparks")
# to convert a data frame of coordinates to an sf object and transform the crs information and create a shapefile for it
nationalparks_sf <- st_as_sf(nationalparks_tibble, coords=c("Lng", "Lat"),
crs=4326) %>%
st_transform(crs = 3414)
write_rds(nationalparks_sf, "data/rds/nationalparks.rds")
```
:::
Import the following to be used for our tasks later.
```{r}
childcare_sf <-st_read("data/geospatial", layer="childcare")
eldercare_sf<- readRDS("data/rds/eldercare.rds")
hawkercentre_new_sf <- readRDS("data/rds/hawker_new.rds")
kindergartens_sf <- readRDS("data/rds/kindergartens.rds")
nationalparks_sf <- readRDS("data/rds/nationalparks.rds")
```
### CBD Area
For this assignment, I consider the CBD area to be in the Downtown Core so I will be using the [coordinates of Downtown Core](https://www.latlong.net/place/downtown-core-singapore-20616.html) .
```{r}
lat= c(1.287953)
lng= c(103.851784)
cbd_sf <- data.frame(lat, lng) %>%
st_as_sf(coords = c("lng", "lat"), crs=4326) %>%
st_transform(crs=3414)
```
### Supermarket
```{r}
supermarket_sf <- st_read("data/geospatial/supermarkets-geojson.geojson")
supermarket_sf <- supermarket_sf %>%
st_transform(crs = 3414)
```
### Bus Stop
```{r}
bus_stop<- st_read(dsn = "data/geospatial", layer = "BusStop")
bus_stop_sf <- bus_stop %>%
st_transform(crs = 3414)
```
### MRT/LRT
```{r}
mrt = st_read(dsn = "data/geospatial/", layer = "Train_Station_Exit_Layer")
mrt_sf <- mrt %>%
st_transform(crs = 3414)
```
### Primary School
While searching for a dataset for primary school locations, i've crossed upon this dataset from [data.gov.sg](https://data.gov.sg/) which has the general information of schools in Singapore. By filtering out `themainlevel_code` which represents the different types of schools to be Primary, I will be able to extract out the address and postal codes of primary schools in Singapore.
```{r}
primary_school <- read_csv("data/geospatial/general-information-of-schools.csv")
primary_school <- primary_school %>%
filter(mainlevel_code == "PRIMARY") %>%
select(school_name, address, postal_code)
glimpse(primary_school)
```
Based on the output, we can observe that there are 183 primary schools in Singapore. However, in 2022, the primary school, Juying Primary School was merged together with Pioneer Primary School and it cannot be found via the API . Thus, I decided to remove it out of our data.
```{r}
primary_school<-primary_school %>%
filter(school_name!='JUYING PRIMARY SCHOOL')
```
Once again, we can use geocode() function we created earlier to help us extract the geometry of each primary school by its school name.
```{r}
#| eval: false
primary_school$LATITUDE <- 0
primary_school$LONGITUDE <- 0
for (i in 1:nrow(primary_school)){
temp_output <- geocode(primary_school[i, 1],"")
primary_school$LATITUDE[i] <- temp_output$results.LATITUDE
primary_school$LONGITUDE[i] <- temp_output$results.LONGITUDE
}
```
To reuse the primary school data without recalling the API, I saved it in an rds file.
```{r}
#| eval: false
write_rds(primary_school, "data/rds/primary_school.rds")
```
```{r}
primary_school_rds<-readRDS("data/rds/primary_school.rds")
```
Next, we can convert our df into a sf.
```{r}
primary_school_sf <- st_as_sf(primary_school_rds,
coords = c("LONGITUDE",
"LATITUDE"),
crs=4326) %>%
st_transform(crs = 3414)
```
However, besides this, we need to determine what is a good primary school (which is necessary for 1 of our tasks). Based on the assumption that a good primary school is a popular one, I picked out the top 10 popular primary schools, referencing its popularity from this [website](https://schoolbell.sg/primary-school-ranking/).
```{r}
popular_primary_schools <-c("Pei Hwa Presbyterian Primary School",
"Gongshang Primary School",
"Riverside Primary School",
"Red Swastika School",
"Punggol Green Primary School",
"Princess Elizabeth Primary School",
"Westwood Primary School",
"St. Hilda’s Primary School",
"Catholic High School (Primary Section)",
"Ai Tong School")
popular_primary_schools
```
Next, I need to check whether the top 10 most popular primary schools can be found in the primary school data i extracted earlier. But before that, to make things consistent, I used lapply() function and make the schools names I picked out from the website to be all in uppercase.
```{r}
#make school names all uppercase
popular_primary_schools <- lapply(popular_primary_schools, toupper)
# to check both primary school datasets matches
popular_primary_schools_sf <- primary_school_sf %>%
filter(school_name %in% popular_primary_schools)
```
```{r}
nrow(popular_primary_schools_sf)
```
Based on the output above, out of the 10 primary schools, only 8 can be found. The code chunk below tells us which primary schools are missing.
```{r}
unique(popular_primary_schools[!(popular_primary_schools %in% popular_primary_schools_sf$school_name)])
```
Looking further into our dataset, I found out that in the original primary school dataset, the schools names of the output above is written different. For eg, CANOSSA CATHOLIC PRIMARY SCHOOL is the same as CATHOLIC HIGH SCHOOL (PRIMARY SECTION). Thus, I decided to use rbind() function to manually add both records to popular_primary_schools_sf.
```{r}
popular_primary_schools_sf <- popular_primary_schools_sf %>%
rbind(primary_school_sf %>% filter(school_name == "CANOSSA CATHOLIC PRIMARY SCHOOL"))
popular_primary_schools_sf <- popular_primary_schools_sf %>%
rbind(primary_school_sf %>% filter(school_name == "ST. HILDA'S PRIMARY SCHOOL"))
```
```{r}
nrow(popular_primary_schools_sf)
```
### Shopping Mall
For shopping malls, I used the dataset extracted by [Valery Lim](https://github.com/ValaryLim/Mall-Coordinates-Web-Scraper) who webscrapped the shopping malls data from its [wikipedia](https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore) in 2019.
```{r}
shopping_mall <- read.csv("data/geospatial/mall_coordinates_updated.csv")
shopping_mall <- shopping_mall %>%
select(name, latitude, longitude)
glimpse(shopping_mall)
```
Taking a glimpse in our data, there is a total of 184 shopping malls in 2019.
Next, the code chunk below transforms our data with the correct Singapore coordinates system.
```{r}
shopping_mall_sf <- st_as_sf(shopping_mall,
coords = c("longitude",
"latitude"),
crs = 4326) %>%
st_transform(crs = 3414)
```
## Data-Pre Processing
Same as before, we will conduct the following steps for data preprocessing, with an additional step of removing irrelevant columns for certain datasets:
- removing irrelevant columns
- check for missing values
- check correct coordinates system
- check for invalid geometries
#### Remove Irrelevant Columns
So for our datasets, we only need to know the name and the geometry so in the following code chunks i will be removing/dropping irrelevant columns.
```{r}
#mpsz
mpsz <- mpsz %>% select("Name")
#childcare_sf
#combine name and address postal to make it unique, some childcare have same name but diff locations
childcare_sf$full_address <- paste(childcare_sf$NAME, childcare_sf$ADDRESSP)
childcare_sf <- childcare_sf %>% select("full_address")
#eldercare_sf
eldercare_sf$full_address <- paste(eldercare_sf$NAME, eldercare_sf$ADDRESSPOSTALCODE)
eldercare_sf <- eldercare_sf %>% select("full_address")
#hawkercentre_new_sf
hawkercentre_new_sf <- hawkercentre_new_sf %>% select("NAME")
#kindergartens_sf
kindergartens_sf <- kindergartens_sf %>% select("NAME")
#nationalparks_sf
nationalparks_sf <- nationalparks_sf %>% select("NAME")
#supermarket
supermarket_sf <- supermarket_sf %>% select("Description")
#bus stop
bus_stop_sf$stop_name <- paste(bus_stop_sf$BUS_STOP_N, bus_stop_sf$BUS_ROOF_N, bus_stop_sf$LOC_DESC)
bus_stop_sf <- bus_stop_sf %>% select("stop_name")
#mrt
#combine stn name and exit to make each row unique
mrt_sf$stn <- paste(mrt_sf$stn_name, mrt_sf$exit_code)
mrt_sf <- mrt_sf %>% select("stn")
```
#### Check for Missing Values
```{r}
#mpsz
sum(is.na(mpsz))
#childcare_sf
sum(is.na(childcare_sf))
+
#eldercare_sf
sum(is.na(eldercare_sf))
#hawkercentre_new_sf
sum(is.na(hawkercentre_new_sf))
#kindergartens_sf
sum(is.na(kindergartens_sf))
#nationalparks_sf
sum(is.na(nationalparks_sf))
#supermarket_sf
sum(is.na(supermarket_sf))
#bus stop
sum(is.na(bus_stop_sf))
#MRT
sum(is.na(mrt_sf))
#shopping_mall_sf
sum(is.na(shopping_mall_sf))
#primary_school_sf
sum(is.na(primary_school_sf))
#popular_primary_schools_sf
sum(is.na(popular_primary_schools_sf))
```
Based on the output, there is no missing values.
#### Check for Correct Coordinates System
```{r}
#mpsz
st_crs(mpsz)
#childcare_sf
st_crs(childcare_sf)
#eldercare_sf
st_crs(eldercare_sf)
#hawkercentre_new_sf
st_crs(hawkercentre_new_sf)
#kindergartens_sf
st_crs(kindergartens_sf)
#nationalparks_sf
st_crs(nationalparks_sf)
#supermarket_sf
st_crs(supermarket_sf)
#bus_stop_sf
st_crs(bus_stop_sf)
#mrt_sf
st_crs(mrt_sf)
#shopping_mall_sf
st_crs(shopping_mall_sf)
#primary_school_sf
st_crs(primary_school_sf)
#popular_primary_schools_sf
st_crs(popular_primary_schools_sf)
```
Based on the output, our datasets are in the correct coordinate systems.
#### Check for Invalid Geometries
```{r}
#mpsz
length(which(st_is_valid(mpsz) == FALSE))
#childcare_sf
length(which(st_is_valid(childcare_sf) == FALSE))
#eldercare_sf
length(which(st_is_valid(eldercare_sf) == FALSE))
#kindergartens_sf
length(which(st_is_valid(kindergartens_sf) == FALSE))
#hawkercentre_new_sf
length(which(st_is_valid(hawkercentre_new_sf) == FALSE))
#nationalparks_sf
length(which(st_is_valid(nationalparks_sf) == FALSE))
#supermarket_sf
length(which(st_is_valid(supermarket_sf) == FALSE))
#bus_stop_sf
length(which(st_is_valid(bus_stop_sf) == FALSE))
#mrt_sf
length(which(st_is_valid(mrt_sf) == FALSE))
#shopping_mall_sf
length(which(st_is_valid(shopping_mall_sf) == FALSE))
#primary_school_sf
length(which(st_is_valid(primary_school_sf) == FALSE))
#popular_primary_schools_sf
length(which(st_is_valid(popular_primary_schools_sf) == FALSE))
```
Based on the output, we can see that only the mpsz dataset has invalid geometries. The code chunk beow addresses this issue.
```{r}
mpsz <- st_make_valid(mpsz)
length(which(st_is_valid(mpsz) == FALSE))
```
#### Visualisations
The following code chunks shows some visualisations for the data we have just preprocessed.
##### **Subzone Boundary of Singapore 2014**
```{r}
plot(st_geometry(mpsz))
```
##### **MRT/LRT Stations Map**
```{r}
tmap_mode("view")
tm_shape(mrt_sf) +
tm_dots(col="red", size=0.05) +
tm_view(set.zoom.limits = c(11, 14))
```
##### **Bus Map**
```{r}
tmap_mode("plot")
tm_shape(mpsz) +
tm_borders(alpha = 0.5) +
tmap_options(check.and.fix = TRUE) +
tm_shape(bus_stop_sf) +
tm_dots(col="red", size=0.05) +
tm_layout(main.title = "Bus Stops",
main.title.position = "center",
main.title.size = 1.2,
frame = TRUE)
```
##### **Education/Healthcare related map**
The code chunk below shows the location of:
- childcare centres - blue dots
- eldercare centres - red dots
- kindergartens - orange dots
- primary schools - black dots
- top 10 popular primary schools - grey dots
```{r}
tmap_mode("view")
tm_shape(childcare_sf) +
tm_dots(alpha=0.5, #affects transparency of points
col="#2ec4b6",
size=0.05) +
tm_shape(eldercare_sf) +
tm_dots(alpha=0.5,
col="#e71d36",
size=0.05) +
tm_shape(kindergartens_sf) +
tm_dots(alpha=0.5,
col="#ff9f1c",
size=0.05) +
tm_shape(primary_school_sf) +
tm_dots(alpha=0.5,
col="#011627",
size=0.05) +
tm_shape(popular_primary_schools_sf) +
tm_dots(alpha=0.5,
col="grey",
size=0.05) +
tm_view(set.zoom.limits = c(11, 14))
```
One thing we can observe is that there are more childcare centres as compared to the rest.
##### Amenities Related Map
The code chunk below shows the location of:
- supermarkets - red dots
- shopping malls - orange dots
- national parks - dark green dots
- hawker centres - blue dots
```{r}
tmap_mode("view")
tm_shape(supermarket_sf) +
tm_dots(alpha=0.5, #affects transparency of points
col="#d62828",
size=0.05) +
tm_shape(shopping_mall_sf) +
tm_dots(alpha=0.5,
col="#f77f00",
size=0.05) +
tm_shape(supermarket_sf) +
tm_dots(alpha=0.5,
col="#fcbf49",
size=0.05) +
tm_shape(nationalparks_sf) +
tm_dots(alpha=0.5,
col="#023020",
size=0.05) +
tm_shape(hawkercentre_new_sf) +
tm_dots(alpha=0.5,
col="#ADD8E6",
size=0.05) +
tm_view(set.zoom.limits = c(10, 14))
```
##### Resale Flat Prices
```{r}
tmap_mode("view")
tmap_options(check.and.fix = TRUE)
tm_shape(resale_sf) +
tm_dots(col = "resale_price",
alpha = 0.6,
style="quantile") +
# sets minimum zoom level to 11, sets maximum zoom level to 14
tm_view(set.zoom.limits = c(11,14))
```
We can observe that from July 2022 to December 2022, the are majority of the 5 room HDB flats can be found in the east side of Singapore and the higher priced flats tend to be in the southern-eastern side of Singapore.
```{r}
tmap_mode("plot")
```
## Formulated Functions
### Proximity Distance Calculations
As per our task , we need to find the proximity of certain facilities i.e proximity to CBD, eldercare, hawker centres, MRT, park, good primary school, shopping mall and supermarket. It can be computed by creating a function called proximity which utilises st_distance() to find the closest facility (shortest distance) with the rowMins() function of our matrixStats package. The values will then be appended to the resale_sf as a new column.
```{r}
proximity <- function(df1, df2, varname) {
dist_matrix <- st_distance(df1, df2) %>%
drop_units()
df1[,varname] <- rowMins(dist_matrix)
return(df1)
}
```
```{r}
resale_sf<-readRDS("data/rds/resale_sf.rds")
```
```{r}
#CBD, eldercare, hawker centres, MRT, park, good primary school, shopping mall and supermarket.
resale_sf <-
proximity(resale_sf, cbd_sf, "PROX_CBD") %>%
proximity(., eldercare_sf, "PROX_ELDERCARE") %>%
proximity(., hawkercentre_new_sf, "PROX_HAWKER") %>%
proximity(., mrt_sf, "PROX_MRT") %>%
proximity(., nationalparks_sf, "PROX_PARK") %>%
proximity(., popular_primary_schools_sf, "PROX_GDPRISCH") %>%
proximity(., shopping_mall_sf, "PROX_SHOPPINGMALL") %>%
proximity(., supermarket_sf, "PROX_SPRMKT")
```
### Facility Count within Radius Calculations
As per our task, we also want to find the number of facilities within a particular radius. Thus, we have to create another function called num_radius() which utilise st_distance() to compute the distance between the flats and the desiredfacilities, and then sum up the observations with rowSums(). The values will be appended to the data frame as a new column.
```{r}
num_radius <- function(df1, df2, varname, radius) {
dist_matrix <- st_distance(df1, df2) %>%
drop_units() %>%
as.data.frame()
df1[,varname] <- rowSums(dist_matrix <= radius)
return(df1)
}
```
```{r}
#Numbers of kindergartens within 350m, Numbers of childcare centres within 350m, Numbers of bus stop within 350m, Numbers of primary school within 1km
resale_sf <-
num_radius(resale_sf, kindergartens_sf, "NUM_KNDRGTN", 350) %>%
num_radius(., childcare_sf, "NUM_CHILDCARE", 350) %>%
num_radius(., bus_stop_sf, "NUM_BUS_STOP", 350) %>%
num_radius(., primary_school_sf, "NUM_PRISCH", 1000)
```
Once again, I would like to save the latest resale_sf file.
```{r}
#| eval: false
write_rds(resale_sf, "data/rds/resale_sf_complete.rds")
```
```{r}
glimpse(readRDS("data/rds/resale_sf_complete.rds"))
```
# Train-Test Split
As earlier mentioned, for the resale dataset,
- Training dataset period: October 2022 to December 2022
- Test dataset period: January 2023 to February 2023
Hence, the code chunk below reads the rds file - resale_sf which we created earlier on and splits the dataset accordingly.
```{r}
resale_sf_complete<-readRDS("data/rds/resale_sf_complete.rds")
```
```{r}
resale_train <- resale_sf_complete %>%
filter(month >= "2022-10" & month <= "2022-12",
flat_type == "5 ROOM")
```
```{r}
resale_test <- resale_sf_complete %>%
filter(month >= "2023-01" & month <= "2023-02",
flat_type == "5 ROOM")
```
For future references, I will be writing the train and test data set of resale_sf as a rds file.
```{r}
#| eval: false
write_rds(resale_train, "data/model/resale_train.rds")
write_rds(resale_test, "data/model/resale_test.rds")
```
The code chunk below reads our train and test dataset of resale_sf.
```{r}
train_data <- read_rds("data/model/resale_train.rds")
test_data <- read_rds("data/model/resale_test.rds")
```
```{r}
glimpse(train_data)
```
Having a glimpse at our training and test data, I realised that there are some columns that are irrelevant for our subsequent tasks such as flat_model , town, flat_type, block and street_name. Thus, I removed it.
```{r}
train_data <-subset(train_data, select = -c(month,flat_model, town,flat_type, block,street_name, storey_range ))
test_data <-subset(test_data, select = -c(month,flat_model, town,flat_type, block,street_name, storey_range ))
```
The lease_commence_date represents the age of the unit that we had calculated earlier on but its name can be misleading so I changed the name to age.
```{r}
colnames(train_data)[2] <- "age"
colnames(test_data)[2] <- "age"
```
Looking at our data again, the type of age and remaining_lease is characters but it should be numeric/integer so i changed it accordingly.
```{r}
train_data$age <- as.numeric(train_data$age)
test_data$age <- as.numeric(test_data$age)
train_data$remaining_lease <- as.numeric(train_data$remaining_lease)
test_data$remaining_lease <- as.numeric(test_data$remaining_lease)
```
Finally, we are done processing our train and test data for our subsequent tasks!
The code chunk below shows a summary of our resale prices.
```{r}
summary(train_data$resale_price)
```
# Computing Correlation Matrix
Before loading the predictors into a predictive model, it is always a good practice to use correlation matrix to examine if there is sign of multicolinearity.
But before that, we need to drop our geometry values.
```{r}
train_nogeo <- train_data %>%
st_drop_geometry()
```
The code chunk below saves the rds file for training data without geometry so we can use it later.
```{r}
#| eval: false
write_rds(train_nogeo, "data/model/train_nogeo.rds")
```
The code chunk below computes the correlation matrix.
```{r}
corrplot::corrplot(cor(train_nogeo),
diag = FALSE,
order = "AOE",
tl.pos = "td",
tl.cex = 0.5,
method = "number",
type = "upper")
```
In [general](https://blog.clairvoyantsoft.com/correlation-and-collinearity-how-they-can-make-or-break-a-model-9135fbe6936a), an absolute correlation coefficient of \>0.7 among two or more predictors indicates the presence of multicollinearity. As we have a lot of variables, it is quite hard to see the correlation value, but most seems to be okay except age with its correlation to remaining_lease being -1. This indicates that age of the unit and the remaining lease have perfect multicollinearity and we should remove either the age or remaining lease when creating our linear regression model.
The code chunk below removes the age of unit from our dataset.
```{r}
#removes age column
train_nogeo<-train_nogeo[,-2]
```
Another way to check multicolinearity is looking at the Variance Inflation Factor (VIF) which we will perform later.
# Multiple Linear Regression Model (OLS)
First, let's build our multi regression model where y is resale_price and x is the independent predictors.
```{r}
mlr1 <- lm(resale_price ~. ,
data = train_nogeo)
summary(mlr1)
```
Based on the output, we can infer that the adjusted R square is 0.7054 which roughly indicates that 70% of the variance of the dependent variable being studied is explained by the variance of the independent variable. It seems that most of the predictors are significant (p-value\<0.5) except for PROX_SPRMKT, PROX_SHOPPINGMALL, PROX_GDPRISCH and NUM_BUS_STOP.
## Test of multicollinearity (VIF)
We can use ols_vif_tol() function of our olsrr package to calculate VIF. In general, if the VIF value is less than 5, then there is usually no sign/possibility of correlations.
```{r}
ols_vif_tol(mlr1)
```
Th higher the VIF, the less reliable our regression results are going to be. Based on the output, there are no signs of multicollinearity as VIF \<5.
The code chunk below saves the results of our model to an rds file.
```{r}
#| eval: false
write_rds(mlr1, "data/model/mlr1.rds" )
```
# GWR Predictive Method
In this section, we will calibrate our mlr1 model to predict HDB resale prices.
## Converting the sf data.frame to SpatialPointDataFrame
```{r}
train_data_sp <- as_Spatial(train_data[,-2])
train_data_sp
```
## Computing adaptive bandwidth
```{r}
#| eval: false
bw_adaptive <- bw.gwr(resale_price ~.,
data = train_data_sp,
approach="CV",
kernel="gaussian",
adaptive=TRUE,
longlat=FALSE
)
```
{width="397"}
Doing cross-validation while utilising a gaussian kernel, the smallest CV score is 3.054284e+12 and its adaptive bandwidth is 22.
Next, we will save the adaptive bandwidth as an rds file.
```{r}
#| eval: false
write_rds(bw_adaptive, file = "data/model/bw_adaptive.rds")
```
## Constructing the adaptive bandwidth GWR model
First, let's call the saved bandwith by using the code chunk below.
```{r}
bw_adaptive <- read_rds("data/model/bw_adaptive.rds")
```
Now, we can go ahead to calibrate the gwr-based hedonic pricing model by using adaptive bandwidth and Gaussian kernel as shown in the code chunk below.
```{r}
#| eval: false
gwr_adaptive <- gwr.basic(formula = resale_price ~.,
data=train_data_sp,
bw=bw_adaptive,
kernel = 'gaussian',
adaptive=TRUE,
longlat = FALSE)
```
The code chunk below will be used to save the model in rds format for future use.
```{r}
#| eval: false
write_rds(gwr_adaptive, "data/model/gwr_adaptive.rds")
```
The code below can be used to display the model output.
```{r}
gwr_adaptive <- read_rds("data/model/gwr_adaptive.rds")
gwr_adaptive
```
From the output above, we can observe that the GWR Adjusted R-square is 0.9243233, which is higher than the non-spatial mulitiple linear regression model's Adjusted R-square of 0.709 . Based on [research](https://builtin.com/data-science/adjusted-r-squared#:~:text=R%2Dsquared%20measures%20the%20goodness,is%20not%20a%20good%20fit.), R-squared measures the goodness of fit of a regression model. Hence, a higher R-squared indicates the model is a good fit, while a lower R-squared indicates the model is not a good fit which suggests that the GWR model is a better fit.
GWR model also has a lower AIC i.e 35569.1 than the regression model's AIC i.e 37841.04. Based on [research](https://www.scribbr.com/statistics/akaike-information-criterion/#:~:text=Lower%20AIC%20scores%20are%20better,be%20the%20better%2Dfit%20model.), the lower AIC scores, the better it is, as AIC penalizes models that use more parameters.
# Preparing coordinates data
## Extracting coordinates data
The code chunk below extract the x,y coordinates of the full, training and test data sets.
```{r}
coords_train <- st_coordinates(train_data)
coords_test <- st_coordinates(test_data)
```
Before continue, we write all the output into rds for future uses
```{r}
#| eval: false
coords_train <- write_rds(coords_train, "data/model/coords_train.rds" )
coords_test <- write_rds(coords_test, "data/model/coords_test.rds" )
```
Now, we need to retrieve our training data, which does not have the geometry.
```{r}
train_nogeo_rds<-readRDS("data/model/train_nogeo.rds")
```
# Calibrating Random Forest Model
Now, we will calibrate a model to predict HDB resale price by using random forest function of ranger package. Set.seed() is used to keep results constant as random forest is a simulation. Based on [research](https://medium.com/@raj5287/effects-of-multi-collinearity-in-logistic-regression-svm-rf-af6766d91f1b#:~:text=Random%20Forest%20uses%20bootstrap%20sampling,different%20set%20of%20data%20points.), random forest uses bootstrap sampling and feature sampling, i.e row sampling and column sampling. Therefore, Random Forest is not affected by multicollinearity that much since it is picking different set of features for different models and every model sees a different set of data points. This means that I do not need to remove the age variable.
```{r}
set.seed(1234)
rf <- ranger(resale_price ~.,
data=train_nogeo_rds,
num.trees = 30)
```
```{r}
print(rf)
```
Based on the output, for a random forest of 30 trees, the OOB prediction error (MSE) is 3246290896 and the R squared (OOB) / R2 is 0.830922 which implies that around 83% of the variance of the train data is explained by the model.
```{r}
sqrt(rf$prediction.error)
```
Random forest rmse is 56976.23.
# Calibrating a Geographically Weighted Random Forest
Next, we would like to calibrate the Geographically Weighted Random Forest by using the grf.bw() function. As the grf.bw() function is computationally expensive, I will be running it on a subset of the data (500 rows). Even though the bandwidth of using 500 rows would differ a lot from the bandwidth of using the whole dataset, but for this assignment approximating it on a subset is good enough as a proof of concept.
```{r}
#| eval: false
gwRF_bw <- grf.bw(formula = resale_price ~
age +
remaining_lease +
floor_area_sqm +
storey_order +
PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT +
PROX_SHOPPINGMALL +
NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH,
data = train_nogeo_rds[0:500,],
kernel = "adaptive",
trees = 30,
coords = coords_train[0:500,]
)
```
{width="389"}
As computation time is too long, I will use what has been generated so far. Based on the output, the best bandwidth is 85 with a R2 of 0.761592272965718.
Next, I used the bandwidth discovered previously to calibrate our GWRF model. Afterwards, I saved it into an rds file.
```{r}
#| eval: false
set.seed(1234)
gwRF_adaptive <- grf(resale_price ~
age +
remaining_lease +
floor_area_sqm + storey_order +
PROX_CBD + PROX_ELDERCARE + PROX_HAWKER +
PROX_MRT + PROX_PARK + PROX_GDPRISCH + PROX_SPRMKT +
PROX_SHOPPINGMALL +
NUM_KNDRGTN + NUM_CHILDCARE + NUM_BUS_STOP + NUM_PRISCH,
dframe=train_nogeo_rds,
bw= 85,
ntree = 30,
kernel="adaptive",
coords=coords_train)
write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")
```
```{r}
#write_rds(gwRF_adaptive, "data/model/gwRF_adaptive.rds")
gwRF_adaptive <- read_rds("data/model/gwRF_adaptive.rds")
```
```{r}
#look at important variables
glimpse(gwRF_adaptive$Local.Variable.Importance)
```
```{r}
gwRF_adaptive$Global.Model
```
The output shows that the R2 is 0.8331023. for a GWR model of 30 trees and a bandwidth of 86.
# Predicting by using Test Data
## Multiple Linear Regression (OLS)
The code chunk below uses predict.lm() function of the stats package to run inference on our test data and save it into RDS format.
```{r}
#| eval: false
ols_pred <- predict.lm(mlr1, test_data[,-2]) %>%
write_rds("data/model/ols_pred.rds")
```
## Random Forest
The code chunk below uses predict() function of the ranger package to run inference on our test data and save it into RDS format.
```{r}
#| eval: false
test_nogeo <-test_data %>%
st_drop_geometry()
rf_pred <- predict(rf, test_nogeo) %>%
write_rds("data/model/rf_pred.rds")
```
## GWRF model
The code chunk below will be used to combine the test data with its corresponding coordinates data.
```{r}
test_data <- cbind(test_data, coords_test) %>%
st_drop_geometry()
```
Next, predict.grf() of spatialML package will be used to predict the resale value by using the test data and gwRF_adaptive model calibrated earlier.
```{r}
#| eval: false
set.seed(1234)
gwRF_pred <- predict.grf(gwRF_adaptive,
test_data,
x.var.name="X",
y.var.name="Y",
local.w=1,
global.w=0)
```
Before moving on, let us save the output into rds file for future use.
```{r}
#| eval: false
GRF_pred <- write_rds(gwRF_pred, "data/model/GRF_pred.rds")
```
The output of the predict.grf() is a vector of predicted values. It is wiser to convert it into a data frame for further visualisation and analysis.
```{r}
GRF_pred <- read_rds("data/model/GRF_pred.rds")
GRF_pred_df <- as.data.frame(GRF_pred)
```
In the code chunk below, cbind() is used to append the predicted values for both the GRF and multiple linear regression mode onto test_data.
```{r}
#for gwrf
test_data_p <- cbind(test_data, GRF_pred_df)
```
```{r}
#| eval: false
write_rds(test_data_p, "data/model/test_data_p.rds")
```
## Models Evaluation
To evaluate the 3 models, we can combine the resale price and predicted prices of each model into a single dataframe.
```{r}
ols_pred_df <- read_rds("data/model/ols_pred.rds") %>%
as.data.frame()
colnames(ols_pred_df) <- c("ols_pred")
rf_pred_df <- read_rds("data/model/rf_pred.rds")$predictions %>%
as.data.frame()
colnames(rf_pred_df) <- c("rf_pred")
gwRF_pred_df <- GRF_pred %>%
as.data.frame()
colnames(gwRF_pred_df) <- c("gwrf_pred")
prices_pred_df <- cbind(test_data$resale_price, ols_pred_df, rf_pred_df,
gwRF_pred_df) %>%
rename("actual_price" = "test_data$resale_price")
head(prices_pred_df)
```
Looking at the results, it seems like gwrf model has the closest predicted value to the actual price as compared to the other models.
# Calculating Root Mean Square Error
The root mean square error (RMSE) allows us to measure how far predicted values are from observed values in a regression analysis.
## Multiple Linear Regression (OLS)
```{r}
#lm
#MSE <- mean((test_data$resale_price - lm_predicted_value)^2)
#rmse_lm <- sqrt(MSE)
sqrt(mean((prices_pred_df$actual_price - prices_pred_df$ols_pred)^2))
```
## Random Forest
```{r}
sqrt(mean((prices_pred_df$actual_price - prices_pred_df$rf_pred)^2))
```
## GWRF model
In the code chunk below, rmse() of Metrics package is used to compute the RMSE for GRF model.
```{r}
#grf
rmse(test_data_p$resale_price,
test_data_p$GRF_pred)
#sqrt(mean((prices_pred_df$actual_price - prices_pred_df$gwrf_pred)^2))
```
Comparing the 3 models RMSE, the GWRF model has the lowest RMSE of 53185.72 as compared to OLS model's RMSE of 69388.4 and random forest model's RMSE of 55187.82. [RMSE](https://help.sap.com/docs/SAP_PREDICTIVE_ANALYTICS/41d1a6d4e7574e32b815f1cc87c00f42/5e5198fd4afe4ae5b48fefe0d3161810.html#:~:text=The%20Root%20Mean%20Squared%20Error%20(RMSE)%20is%20one%20of%20the,the%20target%20value%20(accuracy).) measures the average difference between values predicted by a model and the actual values and provides an estimation of how well the model is able to predict the target value (accuracy). Hence, among the 3 models, the GWRF model will be better model at predicting the resale price.
# Visualising the Predicted Values
```{r}
test_data_models <- cbind(test_data, prices_pred_df)
gwr_plot <- ggplot(data = test_data_p,
aes(x = GRF_pred,
y = resale_price)) +
geom_point()+
geom_abline(col = "Red")
lm_plot <- ggplot(data = test_data_models,
aes(x = ols_pred,
y = actual_price)) +
geom_point()+
geom_abline(col = "Red")
rf_plot <- ggplot(data = test_data_models,
aes(x = rf_pred,
y = actual_price)) +
geom_point()+
geom_abline(col = "Red")
plot_grid(gwr_plot, lm_plot, rf_plot, labels = "AUTO")
```
By comparing the 3 graphs, GWRF model is more linear while OLS and random forest model points on the right half side, majority of it is above the red line. Thus, we can see that the GWRF model is better than the OLS and random forest model as the scatter points are more closer to the diagonal line.
# Conclusion
In conclusion, be comparing the 3 models' performance at predicting the actual resale prices, the RMSE and visualisation of its scatterplots, we can determine that GWRF model is the best model at predicting the actual resale prices. One of the reasons why it performs better is because [GWR](https://www.publichealth.columbia.edu/research/population-health-methods/geographically-weighted-regression#:~:text=outcome%20of%20interest.-,Description,variables%20to%20vary%20by%20locality.) is an outgrowth of ordinary least squares regression (OLS) and adds a level of modeling sophistication by allowing the relationships between the independent and dependent variables to vary by locality.
# Acknowledgment
To conclude, I would like to thank Prof. Kam for our IS415 Geospatial Analytics and Applications course materials & resources. I would also like to thank my seniors, [Megan Sim](https://is415-msty.netlify.app/posts/2021-10-25-take-home-exercise-3/?panelset=base&panelset1=extracted2&panelset4=mpsz&panelset5=recreational%252Flifestyle#proximity-distance-calculation) and [Nor Aisyah Binte Ajit](https://aisyahajit2018-is415.netlify.app/posts/2021-11-07-take-home-exercise-3/) as I have referenced their codes.